Project "Telecom: Identifying Ineffective Operators":

The virtual telephony service CallMeMaybe is developing a new function that will give supervisors information on the least effective operators. An operator is considered ineffective if they have a large number of missed incoming calls (internal and external) and a long waiting time for incoming calls. Moreover, if an operator is supposed to make outgoing calls, a small number of them is also a sign of ineffectiveness.

In this research we are going study the available data on calls made by the operators in order to find which operators may be considered effective and non-effective ones. We'll also perform statistical significance tests in order to find if there's statistically signifacnt difference between two groups in terms of number of all calls made per day be one operator and in terms of share of internal calls to all calls made by an operator.

Step 1: Reading the dataset and looking at the general information

In [1]:
# importing libraries

import pandas as pd
from pandas import DataFrame
import numpy as np
! pip install missingno -U -q
import missingno as msno
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
! pip install plotly -U -q
import plotly.express as px
import plotly.figure_factory as ff
from plotly import graph_objects as go
import re
import math as mth
from scipy import stats as st
from scipy.stats import mannwhitneyu
from IPython.display import display
import sys
import warnings
if not sys.warnoptions:
       warnings.simplefilter("ignore")
from datetime import datetime
In [2]:
# importing dataset

try:
    calls = pd.read_csv('/Users/pavellugovoy/Desktop/data_analysis/final_project/main_project/telecom_dataset_us.csv')
    clients = pd.read_csv('/Users/pavellugovoy/Desktop/data_analysis/final_project/main_project/telecom_clients_us.csv')
    
except:
    calls = pd.read_csv('/datasets/telecom_dataset_us.csv')
    clients = pd.read_csv('/datasets/telecom_clients_us.csv')
In [3]:
# looking at the general information of 'calls' dataset
calls.info()
display(calls.head(10))
display(calls.sample(10))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53902 non-null  int64  
 1   date                 53902 non-null  object 
 2   direction            53902 non-null  object 
 3   internal             53785 non-null  object 
 4   operator_id          45730 non-null  float64
 5   is_missed_call       53902 non-null  bool   
 6   calls_count          53902 non-null  int64  
 7   call_duration        53902 non-null  int64  
 8   total_call_duration  53902 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 3.3+ MB
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-04 00:00:00+03:00 in False NaN True 2 0 4
1 166377 2019-08-05 00:00:00+03:00 out True 880022.0 True 3 0 5
2 166377 2019-08-05 00:00:00+03:00 out True 880020.0 True 1 0 1
3 166377 2019-08-05 00:00:00+03:00 out True 880020.0 False 1 10 18
4 166377 2019-08-05 00:00:00+03:00 out False 880022.0 True 3 0 25
5 166377 2019-08-05 00:00:00+03:00 out False 880020.0 False 2 3 29
6 166377 2019-08-05 00:00:00+03:00 out False 880020.0 True 8 0 50
7 166377 2019-08-05 00:00:00+03:00 in False NaN True 6 0 35
8 166377 2019-08-05 00:00:00+03:00 out False 880020.0 True 8 0 50
9 166377 2019-08-06 00:00:00+03:00 in False NaN True 4 0 62
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
11753 166708 2019-10-24 00:00:00+03:00 in False NaN True 8 0 314
1745 166405 2019-11-28 00:00:00+03:00 out False 939474.0 False 47 6841 7497
25882 167150 2019-09-25 00:00:00+03:00 out True 905542.0 False 1 43 48
25380 167125 2019-10-09 00:00:00+03:00 out False 902748.0 True 93 0 2292
25712 167150 2019-09-12 00:00:00+03:00 out False 905542.0 False 23 1000 1244
16262 166916 2019-10-04 00:00:00+03:00 in False 906396.0 False 1 87 113
19876 166997 2019-09-13 00:00:00+03:00 out False 903318.0 False 16 6747 7007
44287 168025 2019-10-21 00:00:00+03:00 out False 938080.0 True 33 0 739
39482 167742 2019-11-22 00:00:00+03:00 out False 955032.0 False 2 29 80
39808 167756 2019-11-03 00:00:00+03:00 in False 938896.0 False 11 712 791
In [4]:
# looking at the metrics which can be generally evaluated with describe() method

calls.describe()
Out[4]:
user_id operator_id calls_count call_duration total_call_duration
count 53902.000000 45730.000000 53902.000000 53902.000000 53902.000000
mean 167295.344477 916535.993002 16.451245 866.684427 1157.133297
std 598.883775 21254.123136 62.917170 3731.791202 4403.468763
min 166377.000000 879896.000000 1.000000 0.000000 0.000000
25% 166782.000000 900788.000000 1.000000 0.000000 47.000000
50% 167162.000000 913938.000000 4.000000 38.000000 210.000000
75% 167819.000000 937708.000000 12.000000 572.000000 902.000000
max 168606.000000 973286.000000 4817.000000 144395.000000 166155.000000
In [5]:
# looking at the general information of 'clients' dataset
clients.info()
display(clients.head(10))
display(clients.sample(10))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 17.3+ KB
user_id tariff_plan date_start
0 166713 A 2019-08-15
1 166901 A 2019-08-23
2 168527 A 2019-10-29
3 167097 A 2019-09-01
4 168193 A 2019-10-16
5 167764 A 2019-09-30
6 167050 A 2019-08-29
7 168252 A 2019-10-17
8 168495 A 2019-10-28
9 167879 A 2019-10-03
user_id tariff_plan date_start
656 166534 B 2019-08-07
250 167571 C 2019-09-20
405 166761 C 2019-08-19
504 168562 B 2019-10-30
47 166934 A 2019-08-26
154 166782 C 2019-08-19
13 168354 A 2019-10-23
96 168529 C 2019-10-29
90 167325 C 2019-09-11
222 167943 C 2019-10-07

Conclusions

We have taken a first look on the datasets we have in hands and we see that the dataset on operators' activity (we named it 'calls') contains the following information:

  • 'user_id' — client account ID;
  • 'date' — date the statistics were retrieved;
  • 'direction' — call direction ('out' for outgoing, 'in' for incoming);
  • 'internal' — whether the call was internal (between a client's operators);
  • 'operator_id' — operator identifier;
  • 'is_missed_call' — whether the call was missed;
  • 'calls_count' — number of calls;
  • 'call_duration' — call duration (excluding waiting time);
  • 'total_call_duration' — call duration (including waiting time).

The other dataset on the service's clients (we named it 'clients') contains the data on:

  • 'user_id';
  • 'tariff_plan' — client's current plan;
  • 'date_start' — client's registration date.

We also have seen that the datasets have some problems:

  1. there are missing values in the columns of the table 'calls' containing information on whether the call was internal or not ('internal') and on operator's id ('operator_id').
  2. some of the data are of incorrect data type: the values which represent dates are of 'object' type, which is obviously wrong. Moreover there're the values of 'operator_id' in the table 'calls' look like integer numbers, but of float type.

We move further to the data prprocessing where we will check the missing values, change the inappropriate data types and check the data for duplicates.

Step 2: Prepocessing the data

2.1. Checking the data for missing values (including "hidden" ones) and preprocessing them.

In [6]:
# finding the number and percentage of missing values in 'calls'
report = calls.isna().sum().to_frame()
report = report.rename(columns = {0: 'missing_values'})
report['% of total'] = (report['missing_values'] / calls.shape[0]).round(2)
report.sort_values(by = 'missing_values', ascending = False)
Out[6]:
missing_values % of total
operator_id 8172 0.15
internal 117 0.00
user_id 0 0.00
date 0 0.00
direction 0 0.00
is_missed_call 0 0.00
calls_count 0 0.00
call_duration 0 0.00
total_call_duration 0 0.00
In [7]:
# visualizing missing values with missingno library tool
msno.bar(calls)
plt.title ("Visualisation of missing values and non-missing values (*non-missing values are dark)", fontsize = 20)
plt.suptitle("")
plt.show()

Missing values in the column 'operator_id' of the table 'calls' make 15% of all values. It's quite a lot, but we have to remove the rows with these missing values: we cannot fill them, because these values are so called "missing at random", at the same time almost all our further analysis will focus on the operators' individual performance, therefore this data is mandatory and we will not be able to carry out the analysis without it,

AS for missing values in the column 'internal' of the table 'calls' they are also "missing at random", but the good news is that their number is not significant. So we can also remove them without a risk to lose a lot of information.

In [8]:
# removing rows with missing values 

calls.dropna(inplace=True)

# checking the results

calls.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45670 entries, 1 to 53900
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              45670 non-null  int64  
 1   date                 45670 non-null  object 
 2   direction            45670 non-null  object 
 3   internal             45670 non-null  object 
 4   operator_id          45670 non-null  float64
 5   is_missed_call       45670 non-null  bool   
 6   calls_count          45670 non-null  int64  
 7   call_duration        45670 non-null  int64  
 8   total_call_duration  45670 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 3.2+ MB

Now we'are going to check all other values for "hidden" missing values

checking 'user_id'
In [9]:
display(calls['user_id'].unique())
display(calls['user_id'].value_counts())
array([166377, 166391, 166392, 166399, 166405, 166406, 166407, 166428,
       166481, 166482, 166485, 166487, 166495, 166503, 166507, 166511,
       166520, 166533, 166536, 166541, 166548, 166582, 166585, 166604,
       166609, 166619, 166636, 166649, 166658, 166660, 166666, 166669,
       166671, 166677, 166678, 166680, 166688, 166691, 166692, 166697,
       166708, 166711, 166713, 166717, 166725, 166726, 166727, 166732,
       166774, 166782, 166800, 166803, 166812, 166813, 166833, 166836,
       166837, 166839, 166843, 166874, 166879, 166884, 166896, 166899,
       166901, 166908, 166916, 166936, 166939, 166940, 166941, 166946,
       166955, 166958, 166971, 166973, 166974, 166975, 166983, 166985,
       166993, 166997, 167002, 167003, 167004, 167009, 167011, 167012,
       167016, 167031, 167035, 167052, 167053, 167059, 167060, 167061,
       167071, 167078, 167079, 167082, 167102, 167109, 167110, 167112,
       167113, 167122, 167124, 167125, 167130, 167139, 167141, 167144,
       167150, 167158, 167162, 167172, 167175, 167176, 167180, 167183,
       167185, 167188, 167189, 167199, 167200, 167235, 167238, 167242,
       167244, 167246, 167250, 167263, 167264, 167272, 167275, 167277,
       167285, 167286, 167288, 167338, 167359, 167364, 167397, 167402,
       167412, 167436, 167445, 167451, 167455, 167456, 167466, 167471,
       167474, 167475, 167479, 167487, 167497, 167506, 167521, 167530,
       167532, 167533, 167534, 167543, 167545, 167551, 167575, 167580,
       167599, 167621, 167626, 167630, 167634, 167644, 167645, 167646,
       167650, 167653, 167654, 167664, 167682, 167715, 167716, 167727,
       167742, 167744, 167747, 167756, 167759, 167781, 167792, 167799,
       167805, 167816, 167819, 167827, 167828, 167829, 167840, 167847,
       167852, 167870, 167888, 167906, 167908, 167923, 167927, 167930,
       167936, 167947, 167948, 167949, 167955, 167958, 167969, 167976,
       167977, 167989, 167994, 167995, 167998, 167999, 168000, 168004,
       168008, 168010, 168016, 168018, 168019, 168021, 168025, 168041,
       168042, 168047, 168048, 168054, 168062, 168064, 168073, 168090,
       168091, 168093, 168097, 168101, 168105, 168110, 168121, 168128,
       168154, 168155, 168180, 168185, 168187, 168190, 168193, 168202,
       168221, 168225, 168228, 168252, 168253, 168264, 168271, 168275,
       168291, 168295, 168303, 168307, 168315, 168336, 168361, 168366,
       168377, 168412, 168416, 168459, 168464, 168466, 168473, 168481,
       168500, 168522, 168562, 168575, 168579, 168583, 168598, 168601,
       168603, 168606])
168187    2497
166658    1994
167497    1321
167626    1264
166916    1213
          ... 
167364       1
168090       1
167139       1
166548       1
166481       1
Name: user_id, Length: 290, dtype: int64

They look normal

checking 'date'
In [10]:
display(calls['date'].unique())
display(calls['date'].value_counts())
array(['2019-08-05 00:00:00+03:00', '2019-08-06 00:00:00+03:00',
       '2019-08-07 00:00:00+03:00', '2019-08-08 00:00:00+03:00',
       '2019-08-09 00:00:00+03:00', '2019-08-12 00:00:00+03:00',
       '2019-08-13 00:00:00+03:00', '2019-08-14 00:00:00+03:00',
       '2019-08-15 00:00:00+03:00', '2019-08-16 00:00:00+03:00',
       '2019-08-19 00:00:00+03:00', '2019-08-20 00:00:00+03:00',
       '2019-08-21 00:00:00+03:00', '2019-08-22 00:00:00+03:00',
       '2019-08-23 00:00:00+03:00', '2019-08-26 00:00:00+03:00',
       '2019-08-27 00:00:00+03:00', '2019-08-28 00:00:00+03:00',
       '2019-08-29 00:00:00+03:00', '2019-08-30 00:00:00+03:00',
       '2019-09-02 00:00:00+03:00', '2019-09-03 00:00:00+03:00',
       '2019-09-04 00:00:00+03:00', '2019-09-05 00:00:00+03:00',
       '2019-09-06 00:00:00+03:00', '2019-09-09 00:00:00+03:00',
       '2019-09-10 00:00:00+03:00', '2019-09-11 00:00:00+03:00',
       '2019-09-12 00:00:00+03:00', '2019-09-13 00:00:00+03:00',
       '2019-09-16 00:00:00+03:00', '2019-09-17 00:00:00+03:00',
       '2019-09-18 00:00:00+03:00', '2019-09-19 00:00:00+03:00',
       '2019-09-20 00:00:00+03:00', '2019-09-23 00:00:00+03:00',
       '2019-09-24 00:00:00+03:00', '2019-09-25 00:00:00+03:00',
       '2019-09-26 00:00:00+03:00', '2019-09-27 00:00:00+03:00',
       '2019-09-30 00:00:00+03:00', '2019-10-01 00:00:00+03:00',
       '2019-10-02 00:00:00+03:00', '2019-10-03 00:00:00+03:00',
       '2019-10-04 00:00:00+03:00', '2019-10-07 00:00:00+03:00',
       '2019-10-08 00:00:00+03:00', '2019-10-09 00:00:00+03:00',
       '2019-10-10 00:00:00+03:00', '2019-10-11 00:00:00+03:00',
       '2019-10-14 00:00:00+03:00', '2019-10-15 00:00:00+03:00',
       '2019-10-16 00:00:00+03:00', '2019-10-17 00:00:00+03:00',
       '2019-10-18 00:00:00+03:00', '2019-10-21 00:00:00+03:00',
       '2019-10-22 00:00:00+03:00', '2019-10-23 00:00:00+03:00',
       '2019-10-24 00:00:00+03:00', '2019-10-25 00:00:00+03:00',
       '2019-10-28 00:00:00+03:00', '2019-10-29 00:00:00+03:00',
       '2019-10-30 00:00:00+03:00', '2019-10-31 00:00:00+03:00',
       '2019-11-01 00:00:00+03:00', '2019-11-05 00:00:00+03:00',
       '2019-11-06 00:00:00+03:00', '2019-11-07 00:00:00+03:00',
       '2019-11-08 00:00:00+03:00', '2019-11-11 00:00:00+03:00',
       '2019-11-12 00:00:00+03:00', '2019-11-13 00:00:00+03:00',
       '2019-11-14 00:00:00+03:00', '2019-11-15 00:00:00+03:00',
       '2019-11-18 00:00:00+03:00', '2019-11-19 00:00:00+03:00',
       '2019-11-20 00:00:00+03:00', '2019-11-21 00:00:00+03:00',
       '2019-11-22 00:00:00+03:00', '2019-11-25 00:00:00+03:00',
       '2019-11-26 00:00:00+03:00', '2019-11-27 00:00:00+03:00',
       '2019-11-28 00:00:00+03:00', '2019-08-17 00:00:00+03:00',
       '2019-08-18 00:00:00+03:00', '2019-08-24 00:00:00+03:00',
       '2019-08-25 00:00:00+03:00', '2019-08-31 00:00:00+03:00',
       '2019-09-01 00:00:00+03:00', '2019-09-07 00:00:00+03:00',
       '2019-09-08 00:00:00+03:00', '2019-09-14 00:00:00+03:00',
       '2019-09-15 00:00:00+03:00', '2019-09-21 00:00:00+03:00',
       '2019-09-22 00:00:00+03:00', '2019-09-28 00:00:00+03:00',
       '2019-10-05 00:00:00+03:00', '2019-10-27 00:00:00+03:00',
       '2019-11-04 00:00:00+03:00', '2019-11-09 00:00:00+03:00',
       '2019-11-10 00:00:00+03:00', '2019-11-16 00:00:00+03:00',
       '2019-11-23 00:00:00+03:00', '2019-11-24 00:00:00+03:00',
       '2019-08-02 00:00:00+03:00', '2019-08-03 00:00:00+03:00',
       '2019-08-04 00:00:00+03:00', '2019-08-10 00:00:00+03:00',
       '2019-10-06 00:00:00+03:00', '2019-10-19 00:00:00+03:00',
       '2019-10-20 00:00:00+03:00', '2019-10-12 00:00:00+03:00',
       '2019-11-02 00:00:00+03:00', '2019-11-17 00:00:00+03:00',
       '2019-10-13 00:00:00+03:00', '2019-10-26 00:00:00+03:00',
       '2019-11-03 00:00:00+03:00', '2019-09-29 00:00:00+03:00'],
      dtype=object)
2019-11-25 00:00:00+03:00    1083
2019-11-28 00:00:00+03:00    1061
2019-11-27 00:00:00+03:00    1058
2019-11-21 00:00:00+03:00    1055
2019-11-22 00:00:00+03:00    1048
                             ... 
2019-08-05 00:00:00+03:00      14
2019-08-03 00:00:00+03:00       6
2019-08-02 00:00:00+03:00       6
2019-08-10 00:00:00+03:00       4
2019-08-04 00:00:00+03:00       3
Name: date, Length: 118, dtype: int64

They look also fine

checking 'direction'
In [11]:
display(calls['is_missed_call'].unique())
display(calls['is_missed_call'].value_counts())
array([ True, False])
False    30153
True     15517
Name: is_missed_call, dtype: int64

We see no problem here

checking 'is_missed_call'
In [12]:
display(calls['is_missed_call'].unique())
display(calls['is_missed_call'].value_counts())
array([ True, False])
False    30153
True     15517
Name: is_missed_call, dtype: int64

No problem

checking 'calls_count'
In [13]:
display(calls['calls_count'].describe())
count    45670.000000
mean        16.925203
std         59.081220
min          1.000000
25%          1.000000
50%          4.000000
75%         14.000000
max       4817.000000
Name: calls_count, dtype: float64

There's some problem as well. It is not possible to make 4817 calls a day. Probably, there was also a technical issue on extacting data step. Some calls were calculated multiple times. Let's see how the data distributed more precisely.

In [14]:
fig = px.histogram(calls, x="calls_count", 
                   marginal="box", 
                   hover_data=calls.columns)

fig.update_layout(
    title = 'Distribution of number of calls per day" ',
    xaxis_title = "Number of calls",
    yaxis_title = "Number of entries in the dataset",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()

We see that "after" 120 calls there are a not so many entries in the bins. We may agree, that it is possible to make 120 calls per day. More than that it is highly impossible, so to be on a safe side we need to remove such data because it is very probable that this data is crewed.

checking 'call_duration'
In [15]:
display(calls['call_duration'].describe())
count     45670.000000
mean       1010.934399
std        4017.101865
min           0.000000
25%           0.000000
50%         107.000000
75%         774.000000
max      144395.000000
Name: call_duration, dtype: float64
In [16]:
# plotting a histogram to see distribution of such values

fig = px.histogram(calls, x="call_duration", 
                   marginal="box", 
                   hover_data=calls.columns)

fig.update_layout(
    title = 'Distribution of values related to "call_duration" ',
    xaxis_title = "call duration",
    yaxis_title = "Number of entries in the dataset",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()
checking 'total_call_duration'
In [17]:
display(calls['total_call_duration'].describe())
count     45670.000000
mean       1323.697000
std        4735.336635
min           0.000000
25%          68.000000
50%         291.000000
75%        1112.000000
max      166155.000000
Name: total_call_duration, dtype: float64
In [18]:
# plotting a histogram to see distribution of such values

fig = px.histogram(calls, x="total_call_duration", 
                   marginal="box", 
                   hover_data=calls.columns)

fig.update_layout(
    title = 'Distribution of values related to "total_call_duration" ',
    xaxis_title = "total call duration",
    yaxis_title = "Number of entries in the dataset",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()
In [19]:
# looking at the rows with extra long duration 

calls.query('total_call_duration > 28800').head(10)
Out[19]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
6579 166582 2019-08-13 00:00:00+03:00 out False 885890.0 False 308 35126 40739
6581 166582 2019-08-13 00:00:00+03:00 out False 885876.0 False 265 30422 35448
6592 166582 2019-08-14 00:00:00+03:00 out False 885890.0 False 230 30890 35295
6599 166582 2019-08-15 00:00:00+03:00 out False 885890.0 False 417 42258 49942
6609 166582 2019-08-16 00:00:00+03:00 out False 885890.0 False 517 57988 67232
6628 166582 2019-08-19 00:00:00+03:00 out False 885876.0 False 343 36898 43738
6630 166582 2019-08-19 00:00:00+03:00 out False 885890.0 False 546 59043 69599
6637 166582 2019-08-20 00:00:00+03:00 out False 885876.0 False 270 29512 34819
6638 166582 2019-08-20 00:00:00+03:00 out False 885890.0 False 387 46077 52888
6647 166582 2019-08-21 00:00:00+03:00 out False 885890.0 False 511 66966 76310

It looks like there's a problem with these values. There are many outliers which are quite wierd. Something went wrong with the data, when it was extracted.

A normal working day (in the most of the countries) is 8 hours. 8 hours is 28 800 in total. So it is the maximum time that an operator may spend on calls. Of course, some operators may work extra hours, but they also need some rest and there's always some kind of waste of time.

So we assume that 8 hours period seems to be an appropriate time limit we may use as the fence in order to keep only most probable true data. Because the other data are also wrong (for example number of calls) the most reasonable way to handle such data is just to get rid of such values.

In [20]:
# filtering the data 

calls = calls.query('call_duration <= 28800')
calls = calls.query('total_call_duration <= 28800')
calls = calls.query('calls_count < 120')
calls = calls.reset_index(drop=True)

# checking the results

calls.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44903 entries, 0 to 44902
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              44903 non-null  int64  
 1   date                 44903 non-null  object 
 2   direction            44903 non-null  object 
 3   internal             44903 non-null  object 
 4   operator_id          44903 non-null  float64
 5   is_missed_call       44903 non-null  bool   
 6   calls_count          44903 non-null  int64  
 7   call_duration        44903 non-null  int64  
 8   total_call_duration  44903 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 2.8+ MB

2.2. Changing the inappropriate datatypes.

In [21]:
calls.head()
Out[21]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-05 00:00:00+03:00 out True 880022.0 True 3 0 5
1 166377 2019-08-05 00:00:00+03:00 out True 880020.0 True 1 0 1
2 166377 2019-08-05 00:00:00+03:00 out True 880020.0 False 1 10 18
3 166377 2019-08-05 00:00:00+03:00 out False 880022.0 True 3 0 25
4 166377 2019-08-05 00:00:00+03:00 out False 880020.0 False 2 3 29
Converting data type of values of the column 'date' in the table 'calls' from object into datetime
In [22]:
# converting 'object' into 'datetime'
calls['date'] = pd.to_datetime(calls['date'])

# retreiving date from the each value
calls['date'] = calls['date'].dt.date

# convertung "back" into datetime 
calls['date'] = pd.to_datetime(calls['date'])
Converting values in 'internal' column in the table 'calls' from 'object' into 'boolean' format

For sake of saving memory and improving processing speed we convert also values of 'internal' column to boolean column as they are actually of boolean type.

In [23]:
calls['internal'] = calls['internal'].convert_dtypes(convert_boolean=True)
Converting values in 'operator_id' column in the table 'calls' from 'float' into 'integer' format
In [24]:
calls['operator_id'] = calls['operator_id'].astype('int')
In [25]:
# checking the results

display(calls.head())
calls.info()
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-05 out True 880022 True 3 0 5
1 166377 2019-08-05 out True 880020 True 1 0 1
2 166377 2019-08-05 out True 880020 False 1 10 18
3 166377 2019-08-05 out False 880022 True 3 0 25
4 166377 2019-08-05 out False 880020 False 2 3 29
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44903 entries, 0 to 44902
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              44903 non-null  int64         
 1   date                 44903 non-null  datetime64[ns]
 2   direction            44903 non-null  object        
 3   internal             44903 non-null  boolean       
 4   operator_id          44903 non-null  int64         
 5   is_missed_call       44903 non-null  bool          
 6   calls_count          44903 non-null  int64         
 7   call_duration        44903 non-null  int64         
 8   total_call_duration  44903 non-null  int64         
dtypes: bool(1), boolean(1), datetime64[ns](1), int64(5), object(1)
memory usage: 2.5+ MB
In [26]:
display(clients.head())
clients.info()
user_id tariff_plan date_start
0 166713 A 2019-08-15
1 166901 A 2019-08-23
2 168527 A 2019-10-29
3 167097 A 2019-09-01
4 168193 A 2019-10-16
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 17.3+ KB
Converting data type of values of the column 'date_start' in the table 'clients' from object into datetime
In [27]:
# converting 'object' into 'datetime'
clients['date_start'] = pd.to_datetime(clients['date_start'])
In [28]:
display(clients.head())
clients.info()
user_id tariff_plan date_start
0 166713 A 2019-08-15
1 166901 A 2019-08-23
2 168527 A 2019-10-29
3 167097 A 2019-09-01
4 168193 A 2019-10-16
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   user_id      732 non-null    int64         
 1   tariff_plan  732 non-null    object        
 2   date_start   732 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 17.3+ KB

2.3. Checking the dataset for duplicates.

Checking 'calls'
In [29]:
calls.duplicated().sum()
Out[29]:
4103

We found duplicates, so we need to drop them

In [30]:
calls = calls.drop_duplicates().reset_index(drop=True)

# checking the results
print(calls.duplicated().sum())
calls.info()
0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40800 entries, 0 to 40799
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              40800 non-null  int64         
 1   date                 40800 non-null  datetime64[ns]
 2   direction            40800 non-null  object        
 3   internal             40800 non-null  boolean       
 4   operator_id          40800 non-null  int64         
 5   is_missed_call       40800 non-null  bool          
 6   calls_count          40800 non-null  int64         
 7   call_duration        40800 non-null  int64         
 8   total_call_duration  40800 non-null  int64         
dtypes: bool(1), boolean(1), datetime64[ns](1), int64(5), object(1)
memory usage: 2.3+ MB
Checking 'clients'
In [31]:
clients.duplicated().sum()
Out[31]:
0

No duplicates, so there's no more problem.

Conclusions

We have checked all the columns of the two datasets for missing values and have found that there are missing values related to the operators ids, such vales make about 15% of all data. less imporant but of the same kind We do not know the reason why these values are missing, but it looks like that there was a technical issue. But anyway, due the fact that we need them to evaluate efficiency of the operators (the main goal of the study), we had to removed them from the dataset.

Moreover we found that there were probable screwed data in the column 'calls_count', 'call_duration' and 'total_call_duration", so we removed the strange data as well.

We have found no other issues with missing values. We converted the data types to appropriate ones (dates to DateTime, integer to integers).We have found also that there were duplicates in the dataset 'calls', that we have also removed from the dataset.

Step 3: Carrying out EDA

3.1. Studying how often operators missed calls.

  • Finding the ratio of missed calls to all calls received by each operator
In [32]:
calls_by_operator = calls.query('direction == "in"').pivot_table(index = 'operator_id', columns='is_missed_call', values ='direction', aggfunc ='count')
calls_by_operator.head()
Out[32]:
is_missed_call False True
operator_id
879896 21.0 NaN
879898 56.0 NaN
880020 7.0 NaN
880022 8.0 NaN
880026 18.0 NaN
In [33]:
# the reviewer's code: 

calls_by_operator = ((calls
                         .query('direction == "in"')
                         .groupby(['operator_id']) 
                         .agg(total_calls=('calls_count', 'sum'), 
                              days=('date', 'nunique'))
                         .reset_index()
                         .merge((calls.query('direction == "in" & is_missed_call == True')
                                        .groupby(['operator_id'])
                                        .agg(mised_calls=('calls_count', 'sum')))
                                         .reset_index(), on = "operator_id", how = 'left'))
                        .fillna(0).sort_values(by = 'mised_calls', ascending = False)
                     )
    

calls_by_operator.head() 
Out[33]:
operator_id total_calls days mised_calls
289 913942 2340 57 52.0
598 944226 180 36 30.0
570 940588 1702 29 30.0
22 885890 1244 73 26.0
527 937956 773 44 24.0
In [34]:
# calculating the share of missed calls to all calls

calls_by_operator['share_missed'] = calls_by_operator['mised_calls']/calls_by_operator['total_calls']

# checking the results

calls_by_operator.head()
Out[34]:
operator_id total_calls days mised_calls share_missed
289 913942 2340 57 52.0 0.022222
598 944226 180 36 30.0 0.166667
570 940588 1702 29 30.0 0.017626
22 885890 1244 73 26.0 0.020900
527 937956 773 44 24.0 0.031048
  • Plotting a histogram showing the distribution of the said ratios, finding trends and outliers
In [35]:
fig = px.histogram(calls_by_operator, x="share_missed", 
                   marginal="box", # or violin, rug
                   hover_data=calls_by_operator.columns)

fig.update_layout(
    title = 'Distribution of values related to share of missed calls to all incoming calls ',
    xaxis_title = "Share of missed calls to all incoming calls",
    yaxis_title = "Number of operators",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()

We see on the histogram that there's a lot of operators with a relevant low share of missed calls, the upper fence for outliers according the "boxplot" method is 1,7%. But the number of outliers is quite high. So let's look what the said upper limit and and a little bit higher limit mean in terms of share of such operators to all operators.

In [36]:
len(calls_by_operator.query('share_missed >= 0.017'))/len(calls_by_operator)
Out[36]:
0.17639257294429708

17% of operators have a share of missed calls exceeding 1.7% limit. It's quite a lot. Let's look at the distribution of such values more precisely.

  • Plotting density histograms showing the distribution of the said ratios
In [37]:
sns.distplot(calls_by_operator['share_missed'], hist = False, kde = True,
                 kde_kws = {'linewidth': 3},
                 label = 'share_missed')
plt.title ("Density plot for distribution of shares of missed calls to all calls", fontsize = 12)
plt.xlabel("Share of missed calls to all calls")
plt.ylabel("Density")
plt.suptitle("")
plt.show()

plt.show()

We see that the density (probability) of share of missed calls to all calls drastically falls around 8-10 %. Let's see what that mean in terms of share of operators and find mean, quartiles and the upper percentiles.

  • Finding the mean, the median and quartiles of ratio values
In [38]:
calls_by_operator['share_missed'].describe()
Out[38]:
count    754.000000
mean       0.017637
std        0.065322
min        0.000000
25%        0.000000
50%        0.000000
75%        0.007022
max        1.000000
Name: share_missed, dtype: float64
In [39]:
# finding the 95's percentile
np.percentile(calls_by_operator['share_missed'], 95)
Out[39]:
0.09142786561264825
In [40]:
# finding 90's percentile
np.percentile(calls_by_operator['share_missed'], 90)
Out[40]:
0.037037037037037035
In [41]:
# finding 85's percentile
np.percentile(calls_by_operator['share_missed'], 85)
Out[41]:
0.0213359323243133
In [42]:
# finding 80's percentile
np.percentile(calls_by_operator['share_missed'], 80)
Out[42]:
0.013074388763655256
  • Determing the threshold of the "high" ratio of missed calls to all calls received by an operator

It's hard to define after which rate the inefficiency commences in this case: there are a lot of "outliers", but their huge number means that they may be also "normal". i.e. they are maybe not outliers, that's just a trend. When we look at the dimensions of shares dividing operators into percentiles we see that the th share of missed calls for each percentile after 85's is growing drastically and 85's percentile threshold differs from 80's less than 90's from 85'. So it seems that 85' percentile is a good threshold for real outliers no matter which approach we use. That's why we decide to choose it as the threshold of efficiency performance.

In [43]:
# setting up a threshold as variable for future needs

missed_calls_thrld = np.percentile(calls_by_operator['share_missed'], 85)

3.2. Studying distribution of waiting time values for incoming calls.

  • Filter the data by selecting only incoming calls and finding waiting time for each call and storing this data into a separate column
In [44]:
# creating a subset with only incoming calls and non-missed calls

incoming_calls = calls.query('direction == "in" and is_missed_call == False')
incoming_calls = incoming_calls.reset_index(drop=True)
incoming_calls.head()
Out[44]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-12 in False 880028 False 1 407 411
1 166377 2019-08-13 in False 880028 False 1 88 102
2 166377 2019-08-14 in False 880026 False 2 197 218
3 166377 2019-08-14 in False 880028 False 1 33 37
4 166377 2019-08-15 in False 880028 False 1 23 27
In [45]:
# calculating waiting time for calls made on each date by each operator 

incoming_calls['wait_time'] = incoming_calls['total_call_duration'] - incoming_calls['call_duration']
#incoming_calls['wait_time_per_call'] = incoming_calls['wait_time_all_calls']/incoming_calls['calls_count']

# checking the results

display(incoming_calls.sample(10))
display(incoming_calls.info())
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration wait_time
2816 166800 2019-11-07 in False 892530 False 1 25 59 34
3500 166916 2019-09-27 in False 906396 False 9 1278 1612 334
10972 168187 2019-11-15 in False 937810 False 3 328 354 26
4683 167004 2019-11-08 in False 900826 False 1 149 163 14
3115 166879 2019-09-29 in False 896536 False 2 25 62 37
1426 166582 2019-09-21 in False 885876 False 4 230 255 25
9083 167828 2019-11-07 in False 940430 False 4 249 253 4
7228 167445 2019-11-14 in False 920728 False 7 1942 2264 322
6201 167176 2019-11-15 in False 905104 False 3 82 137 55
10899 168187 2019-11-12 in False 937902 False 10 785 943 158
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11953 entries, 0 to 11952
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              11953 non-null  int64         
 1   date                 11953 non-null  datetime64[ns]
 2   direction            11953 non-null  object        
 3   internal             11953 non-null  boolean       
 4   operator_id          11953 non-null  int64         
 5   is_missed_call       11953 non-null  bool          
 6   calls_count          11953 non-null  int64         
 7   call_duration        11953 non-null  int64         
 8   total_call_duration  11953 non-null  int64         
 9   wait_time            11953 non-null  int64         
dtypes: bool(1), boolean(1), datetime64[ns](1), int64(6), object(1)
memory usage: 782.2+ KB
None
  • Finding average waiting time for all calls received by each operator
In [46]:
# creating a subset where we store the data on average waiting time for each operator

calls_by_operator_wait_time = (incoming_calls
                               .groupby('operator_id')
                               .agg(total_calls=('calls_count','sum'), 
                                    total_wait_time =('wait_time', 'sum'))
                               .reset_index()
                              )
calls_by_operator_wait_time['avg_per_call'] = (calls_by_operator_wait_time['total_wait_time']
                                               /calls_by_operator_wait_time['total_calls'])

# checking the results

calls_by_operator_wait_time.sample(10)
Out[46]:
operator_id total_calls total_wait_time avg_per_call
180 904058 19 311 16.368421
426 929626 35 438 12.514286
145 901038 5 114 22.800000
45 890402 539 5559 10.313544
523 937880 7 65 9.285714
52 891254 1 26 26.000000
18 884478 22 323 14.681818
112 898558 22 622 28.272727
67 892532 65 1863 28.661538
204 905932 36 288 8.000000
  • Plotting a histogram of waiting time values for calls received by operators
In [47]:
fig = px.histogram(calls_by_operator_wait_time, x="avg_per_call", 
                   marginal="box", # or violin, rug
                   hover_data=calls_by_operator_wait_time.columns)

fig.update_layout(
    title = 'Distribution of values related to average waiting time ',
    xaxis_title = "Average waiting time",
    yaxis_title = "Number of operators",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()

The upper-fence for identifying the outliers is 39,6 seconds. When we see at the histogram it seems that it's just: there are outliers, but they are clearly out of the trend. But let's at the share of such operators

In [48]:
len(calls_by_operator_wait_time.query('avg_per_call > 39.6'))/len(calls_by_operator_wait_time)
Out[48]:
0.06374501992031872

It is only 6 per cent of operators. So the most of the operators are "close" to each other in terms of average waiting time.

  • Determing the threshold for long waiting time calls

It seems that the upper-fence for outliers according box-plot method is a good way to determine where inefficiency begins. The outliers defined by this approach are definitely out of the trend and less efficient than others. So we fix the threshold accordingly.

In [49]:
# setting up the threshold for long waiting time calls

long_wait_time__thrld = 39.6

3.3. Studying distribution of numbers of outgoing calls.

  • Finding average number of calls made per day by an operator
In [50]:
calls.head()
Out[50]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-05 out True 880022 True 3 0 5
1 166377 2019-08-05 out True 880020 True 1 0 1
2 166377 2019-08-05 out True 880020 False 1 10 18
3 166377 2019-08-05 out False 880022 True 3 0 25
4 166377 2019-08-05 out False 880020 False 2 3 29
In [51]:
calls.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40800 entries, 0 to 40799
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              40800 non-null  int64         
 1   date                 40800 non-null  datetime64[ns]
 2   direction            40800 non-null  object        
 3   internal             40800 non-null  boolean       
 4   operator_id          40800 non-null  int64         
 5   is_missed_call       40800 non-null  bool          
 6   calls_count          40800 non-null  int64         
 7   call_duration        40800 non-null  int64         
 8   total_call_duration  40800 non-null  int64         
dtypes: bool(1), boolean(1), datetime64[ns](1), int64(5), object(1)
memory usage: 2.3+ MB

We need first define if a given operator is supposed to make outgoing calls.

Let's filter the data by 'internal' field (the calls shall be external) and by the direction field (the calls shall be outgoing).

In [52]:
# creating a subset with filtered the data by 'direction' field and 'internal' field

outgoing_calls = (calls[(calls["internal"] == False) & (calls["direction"] == "out")]).reset_index(drop=True)

# checking the results
display(outgoing_calls.head())
outgoing_calls.info()
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-05 out False 880022 True 3 0 25
1 166377 2019-08-05 out False 880020 False 2 3 29
2 166377 2019-08-05 out False 880020 True 8 0 50
3 166377 2019-08-06 out False 881278 True 3 0 29
4 166377 2019-08-06 out False 880020 True 5 0 70
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23219 entries, 0 to 23218
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              23219 non-null  int64         
 1   date                 23219 non-null  datetime64[ns]
 2   direction            23219 non-null  object        
 3   internal             23219 non-null  boolean       
 4   operator_id          23219 non-null  int64         
 5   is_missed_call       23219 non-null  bool          
 6   calls_count          23219 non-null  int64         
 7   call_duration        23219 non-null  int64         
 8   total_call_duration  23219 non-null  int64         
dtypes: bool(1), boolean(1), datetime64[ns](1), int64(5), object(1)
memory usage: 1.3+ MB
In [53]:
# grouping data by operator and by date and calculating the number of calls

outgoing_calls_operator = outgoing_calls.groupby('operator_id').agg(calls_avg_n=('calls_count','mean'))

# checking the results

outgoing_calls_operator.head(20)
Out[53]:
calls_avg_n
operator_id
879896 8.575000
879898 44.158940
880020 3.400000
880022 3.222222
880026 14.490066
880028 16.162338
880240 5.857143
881278 2.400000
882684 6.445545
882686 19.173333
882688 6.142857
882690 35.400000
883018 2.000000
883898 1.000000
883940 20.387097
883942 17.955752
884402 1.000000
884408 1.800000
884412 8.077778
884478 32.371429
  • Plotting a histrogram showing distribution of average number of outgoing calls through operators, finding trends and outliers
In [54]:
fig = px.histogram(outgoing_calls_operator, x="calls_avg_n", 
                   marginal="box",
                   hover_data=outgoing_calls_operator.columns)

fig.update_layout(
    title = 'Distribution of average numbers of outgoing calls per day and per operator',
    xaxis_title = "Average number of outgoing calls per day and per operator",
    yaxis_title = "Number of operators",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()
In [55]:
outgoing_calls_operator["calls_avg_n"].describe()
Out[55]:
count    833.000000
mean      13.321047
std       18.751722
min        1.000000
25%        2.000000
50%        4.018182
75%       15.750000
max       92.277778
Name: calls_avg_n, dtype: float64

The picture is intersting. There are a lot of operators who made a really small number of calls. It's not very probable that all of them are inefficient. Probably they are just not supposed to make such calls. Perhabs they make them by occasion with incoming calls. In that case we cannot say they are ineffective.

Let's see if the operators making outgoing calls receive also incoming calls, and if so, then how many such calls they receive compare making outgoing calls.

In [56]:
# creating a subset where we calculate incoming calls and outgoing calls 

outgoing_callers= ((calls.query('internal == False & direction == "out"')
                        .groupby('operator_id')
                        .agg(total_calls_out=('calls_count','sum'), 
                            avg_calls_out=('calls_count', 'mean'))
                        .reset_index())
                    .merge((calls.query('internal == False & direction == "in"')
                        .groupby('operator_id')
                        .agg(total_calls_in=('calls_count','sum'),
                            avg_calls_in=('calls_count', 'mean'))
                        .reset_index()), on='operator_id', how='left')
                    .fillna(0)
                    )

# checking the results

outgoing_callers.head(10)
Out[56]:
operator_id total_calls_out avg_calls_out total_calls_in avg_calls_in
0 879896 686 8.575000 58.0 2.761905
1 879898 6668 44.158940 103.0 1.839286
2 880020 34 3.400000 7.0 1.000000
3 880022 174 3.222222 8.0 1.000000
4 880026 2188 14.490066 24.0 1.333333
5 880028 2489 16.162338 63.0 1.536585
6 880240 41 5.857143 0.0 0.000000
7 881278 12 2.400000 0.0 0.000000
8 882684 651 6.445545 613.0 12.260000
9 882686 2876 19.173333 766.0 11.264706

Let's see how many received no incoming calls

In [57]:
len(outgoing_callers.query('total_calls_in == 0'))
Out[57]:
317

Not a lot, let's which operators made more outgoing calls than incoming

In [58]:
# finding a ratio of average number of incoming calls to outgoing calls

outgoing_callers['ratio_avg_in_out'] = (outgoing_callers['avg_calls_in']
                                 /outgoing_callers['avg_calls_out'])

# checking the results
outgoing_callers.sample(10)
Out[58]:
operator_id total_calls_out avg_calls_out total_calls_in avg_calls_in ratio_avg_in_out
126 899788 600 16.666667 0.0 0.000000 0.000000
532 937788 472 8.740741 121.0 4.653846 0.532432
255 910532 151 8.882353 0.0 0.000000 0.000000
171 902774 1099 23.382979 0.0 0.000000 0.000000
480 932502 755 25.166667 4.0 1.000000 0.039735
479 932500 1167 26.522727 8.0 1.333333 0.050271
567 939222 873 14.550000 34.0 1.619048 0.111275
454 930186 53 3.533333 5.0 1.250000 0.353774
727 951648 161 40.250000 0.0 0.000000 0.000000
507 937352 27 2.454545 0.0 0.000000 0.000000
In [59]:
outgoing_callers['ratio_avg_in_out'].describe()
Out[59]:
count    833.000000
mean       0.915158
std        3.187422
min        0.000000
25%        0.000000
50%        0.303904
75%        0.950635
max       51.388889
Name: ratio_avg_in_out, dtype: float64
In [60]:
len(outgoing_callers.query('ratio_avg_in_out <= 1'))
Out[60]:
662

More operators. They are definitely supposed to make outgoing calls - they make them more than incoming. So we filter the subset for such operators and then find the threshold of efficiency.

In [61]:
outgoing_callers = outgoing_callers.query('ratio_avg_in_out <= 1')
outgoing_callers = outgoing_callers.reset_index(drop=True)

# checking the results
outgoing_callers.sample(10)
Out[61]:
operator_id total_calls_out avg_calls_out total_calls_in avg_calls_in ratio_avg_in_out
421 937868 250 4.901961 61.0 2.772727 0.565636
122 900792 5 1.666667 1.0 1.000000 0.600000
647 969386 53 7.571429 0.0 0.000000 0.000000
19 885876 2157 30.814286 992.0 11.022222 0.357698
448 939376 1075 17.063492 261.0 8.700000 0.509860
164 905844 1032 9.381818 13.0 1.083333 0.115472
284 919906 1495 46.718750 0.0 0.000000 0.000000
292 920838 171 3.562500 65.0 2.708333 0.760234
596 955820 9 1.500000 0.0 0.000000 0.000000
260 919202 2433 67.583333 0.0 0.000000 0.000000
  • Looking at the distribution of average number of outgoing calls once again
In [62]:
fig = px.histogram(outgoing_callers, x="avg_calls_out", 
                   marginal="box",
                   hover_data=outgoing_callers.columns)

fig.update_layout(
    title = 'Distribution of average numbers of outgoing calls per day and per operator',
    xaxis_title = "Average number of outgoing calls per day and per operator",
    yaxis_title = "Number of operators",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()
In [63]:
outgoing_callers['avg_calls_out'].describe()
Out[63]:
count    662.000000
mean      15.809411
std       20.163150
min        1.000000
25%        2.358021
50%        6.052632
75%       21.053571
max       92.277778
Name: avg_calls_out, dtype: float64
In [64]:
# finding 40's percentile
np.percentile(outgoing_callers['avg_calls_out'], 40)
Out[64]:
3.6549019607843145

In fact, even after we filtered the data using a quite sofisticated way the picture is not much changed, we see that a great part of operators make really small number of calls. But maybe it's really the case: there is also a good part of operators which make a relatively higher numbe of calls, they seem to be effective and the others are not.

  • Determing threshold for the low number of calls

Taking into account the relevant efficient part operators which made more than 10 calls per day, we find apporpriate to set up the relevant threshold as 6 calls per day

In [65]:
# setting up the threshold for number of outgoing calls

outgoing_calls_thrld = 6

Step 4: Identifying Ineffective Operators

  • Classifying the operators as effective and ineffective based on the thresholds for the metrics found during the EDA
In [66]:
# creating a dataframe with operator id of all operators

operators_series = pd.Series(calls['operator_id'].unique())

operators = pd.DataFrame(operators_series)

operators = operators.rename(columns={0:'operator_id'})

# checking the results

display(operators.head())
display(operators.tail())
operators.info()
operator_id
0 880022
1 880020
2 881278
3 880028
4 880026
operator_id
1087 958394
1088 952914
1089 952916
1090 959118
1091 957922
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1092 entries, 0 to 1091
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   operator_id  1092 non-null   int64
dtypes: int64(1)
memory usage: 8.7 KB
In [67]:
# creating a list of ineffective operators based on the threshold for ratio of missed calls 

missed_calls_list_ineffective = (calls_by_operator.query('share_missed > @missed_calls_thrld')['operator_id'].unique())

# creating a list of ineffective operators based on the threshold for ratio of missed calls

long_wait_time_list_ineffective = (calls_by_operator_wait_time
                                   .query('avg_per_call > @long_wait_time__thrld')['operator_id'].unique())


# creating a list of ineffective operators based on the thereshold for average number of outgoing calls

small_number_outgoing_calls_ineffective = (outgoing_callers
                                           .query('avg_calls_out < @outgoing_calls_thrld')['operator_id'].unique())

let's find the length of each list

In [68]:
len(missed_calls_list_ineffective)
Out[68]:
113
In [69]:
len(long_wait_time_list_ineffective)
Out[69]:
48
In [70]:
len(small_number_outgoing_calls_ineffective)
Out[70]:
329
In [71]:
# creating a function returning a label for operator id 

def efficiency_classifier(operator_id):
    if  (operator_id in missed_calls_list_ineffective
        and operator_id in long_wait_time_list_ineffective
        or operator_id in small_number_outgoing_calls_ineffective):
        return 'ineffective'
    else:
        return 'effective'
In [72]:
# testing if the function works 
efficiency_classifier(958394)
Out[72]:
'effective'
In [73]:
# applaying the function to operators ids and getting the value indicating if they are effective or not

operators['efficiency']=operators['operator_id'].apply(lambda x: efficiency_classifier(x))
operators.head()
Out[73]:
operator_id efficiency
0 880022 ineffective
1 880020 ineffective
2 881278 ineffective
3 880028 effective
4 880026 effective
In [74]:
# looking at the split of the operators into two groups

operators['efficiency'].value_counts()
Out[74]:
effective      760
ineffective    332
Name: efficiency, dtype: int64

Let's look at the ratios of two groups of operators two all operators

In [75]:
# calculating the ratio of effective operators to all operators

effective_operators_rat = len(operators[operators['efficiency'] == 'effective'])/len(operators)
In [76]:
# calculating the ratio of effective operators to all operators
1 - effective_operators_rat
Out[76]:
0.30402930402930406

We have finally got a table of effective and ineffective operators. According to our approach the ineffictive operators make around 30 % of all operators. It's quite high. But the most part of them are operators making outgling calls. That is the issue.

Step 5: Statistical hypotheses testing

5.1. Testing hypothesis that the average number of all calls made or received per day by an operator differs within effective and ineffective operators.

First, we find the average number of calls per day for each operator and then we will use our function once again in order to split the list into two groups.

In [77]:
calls.head()
Out[77]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-05 out True 880022 True 3 0 5
1 166377 2019-08-05 out True 880020 True 1 0 1
2 166377 2019-08-05 out True 880020 False 1 10 18
3 166377 2019-08-05 out False 880022 True 3 0 25
4 166377 2019-08-05 out False 880020 False 2 3 29
In [78]:
# calculating the number of calls for each day, for each operator

all_calls_by_operator = calls.query('is_missed_call == False').groupby(['operator_id', 'date']).agg({'calls_count':'sum'}).reset_index()

all_calls_by_operator = all_calls_by_operator.groupby('operator_id').agg(n_calls=('calls_count', 'mean'))

all_calls_by_operator = all_calls_by_operator.reset_index()

# checking the results

display(all_calls_by_operator.sample(10))
all_calls_by_operator.info()
operator_id n_calls
315 908960 9.883117
616 932856 3.000000
341 910958 1.444444
175 900354 7.879310
154 898902 3.000000
132 896016 4.387755
1031 968338 22.714286
656 937156 4.111111
385 915360 5.615385
516 924544 3.323529
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1058 entries, 0 to 1057
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   operator_id  1058 non-null   int64  
 1   n_calls      1058 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 16.7 KB
In [79]:
# applying the written function to classify the operators

all_calls_by_operator['efficiency'] = (all_calls_by_operator['operator_id']
                                       .apply(lambda x: efficiency_classifier(x))
                                      )
# checking the results
all_calls_by_operator.head()
Out[79]:
operator_id n_calls efficiency
0 879896 11.680851 effective
1 879898 54.679012 effective
2 880020 1.769231 ineffective
3 880022 2.358974 ineffective
4 880026 20.571429 effective

Now let's see if the average number of calls differs between the two groups

In [80]:
all_calls_efficiency = all_calls_by_operator.groupby('efficiency').agg({'n_calls':'mean'})

all_calls_efficiency
Out[80]:
n_calls
efficiency
effective 18.19456
ineffective 3.29837

The mean values for two groups differ significantly, but let's see how the average number of calls values are distributed within two groups by looking at a histogram

In [81]:
fig = px.histogram(all_calls_by_operator, x="n_calls", color='efficiency',
                   marginal="box",
                   hover_data=all_calls_by_operator.columns)

fig.update_layout(
    title = 'Average numbers of calls per day within effective and ineffective operators',
    xaxis_title = "Average number of calls per day",
    yaxis_title = "Number of operators",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()

We see that there is a difference in disribution of the values.

Let's test if this difference is significant using Mann-Whitney U Test. We formulate the null hypothesis (H0) as "there's no difference in distributions of values" and the alternate hypothesis (H1) as "there's a difference in distributions of values". We set up the significance level as "0.05" which is sufficient in our case as there's no multicollinearity in performing this test (it is actually the only one).

In [82]:
# creating two samples

effective_operators_calls_sample = (
    all_calls_by_operator[all_calls_by_operator['efficiency'] =='effective']['n_calls']
)

ineffective_operators_calls_sample = (
    all_calls_by_operator[all_calls_by_operator['efficiency'] =='ineffective']['n_calls']
)
In [83]:
# comparing samples
stat, p = mannwhitneyu(effective_operators_calls_sample, ineffective_operators_calls_sample)
print('Statistics=%.3f, p=%.5f' % (stat, p))


# interpretating
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')
Statistics=70447.000, p=0.00000
Different distribution (reject H0)

We see that the distribution is not "equal". So we may say, that effective users make on everage more calls and even if it was not a part of the metrics to measure efficiency, we see that there's a difference. At the same time we need to notice that the main part of the ineffective operators is made by those who made not a "large" number of outgoing calls, so the result of such a hypothesis may be screwed by this fact.

5.2. Testing hypothesis that the share of internal calls to all calls differs within effective and ineffective operators.

In [84]:
calls.head()
Out[84]:
user_id date direction internal operator_id is_missed_call calls_count call_duration total_call_duration
0 166377 2019-08-05 out True 880022 True 3 0 5
1 166377 2019-08-05 out True 880020 True 1 0 1
2 166377 2019-08-05 out True 880020 False 1 10 18
3 166377 2019-08-05 out False 880022 True 3 0 25
4 166377 2019-08-05 out False 880020 False 2 3 29
In [85]:
# creating a subset with data on number of external and internal calls split by operator

internal_calls_by_operators = calls.pivot_table(columns='internal', index='operator_id', aggfunc='sum', values='calls_count')

internal_calls_by_operators.reset_index(inplace=True)

internal_calls_by_operators = (internal_calls_by_operators
                               .rename_axis(None, axis=1).reset_index(drop=True)
                              )

internal_calls_by_operators = (internal_calls_by_operators
                               .rename(columns={False:'external', True:'internal'})
                              )

internal_calls_by_operators['all_calls'] = (internal_calls_by_operators['external'] 
                                           + internal_calls_by_operators['internal']
                                          )
internal_calls_by_operators['internal_share'] = (internal_calls_by_operators['internal']
                                                /internal_calls_by_operators['all_calls'])

# applying the written function to classify the operators

internal_calls_by_operators['efficiency'] = (internal_calls_by_operators['operator_id']
                                            .apply(lambda x: efficiency_classifier(x) )
                                            )
internal_calls_by_operators.head()
Out[85]:
operator_id external internal all_calls internal_share efficiency
0 879896 744.0 56.0 800.0 0.070000 effective
1 879898 6771.0 18.0 6789.0 0.002651 effective
2 880020 41.0 4.0 45.0 0.088889 ineffective
3 880022 182.0 15.0 197.0 0.076142 ineffective
4 880026 2212.0 20.0 2232.0 0.008961 effective

Now let's look at mean values of shares of internal calls within two groups

In [86]:
internal_calls_by_operators.groupby('efficiency').agg({'internal_share':'mean'})
Out[86]:
internal_share
efficiency
effective 0.114155
ineffective 0.311415

We see that the two groups differ from each other quite significant in terms of mean share of internal calls to all calls. Let's see more precisely at this situation by looking as usual at the histogram and the boxplot descring the distribution between two samples.

In [87]:
fig = px.histogram(internal_calls_by_operators, x="internal_share", color='efficiency',
                   marginal="box",
                   hover_data=internal_calls_by_operators.columns)

fig.update_layout(
    title = 'Distribution of shares of internal calls to all calls made by an operator',
    xaxis_title = "Share of internal calls to all calls made by an operator",
    yaxis_title = "Number of operators",
    font=dict(
        family="Arial",
        size=12,
        color="RebeccaPurple"
    )
    )
fig.show()

We see that the distribution of values within two samples differs. But we need to find if such differenct is statistically significant. As above, we'll use Mann-Whitney U Test. And again, as before we formulate the null hypothesis (H0) as "there's no difference in distributions of values" and the alternate hypothesis (H1) as "there's a difference in distributions of values".

We set up the significance level as "0.05" which is sufficient in this case as well as there's no multicollinearity in performing with other tests.

In [88]:
# creating two samples

effective_operators_internal_calls_sample = (
    internal_calls_by_operators[internal_calls_by_operators['efficiency'] =='effective']['internal_share']
)

ineffective_operators_internal_calls_sample = (
    internal_calls_by_operators[internal_calls_by_operators['efficiency'] =='ineffective']['internal_share']
)
In [89]:
# comparing samples
stat, p = mannwhitneyu(effective_operators_internal_calls_sample, ineffective_operators_internal_calls_sample)
print('Statistics=%.3f, p=%.5f' % (stat, p))


# interpretating
alpha = 0.05
if p > alpha:
    print('Same distribution (fail to reject H0)')
else:
    print('Different distribution (reject H0)')
Statistics=106534.000, p=0.00002
Different distribution (reject H0)

As we see the test shows that we have no ground to accept the null hypothesis, therefore we may say that the difference between the distributions of two samples is statistically significant.

It's a quite interesting fact. It means that "effective" operators made less internal calls. We do not know why exactly operators had to make internal calls. It may be explained by the fact that operators make them in order to put questions to their colleagues because they lack of expertise and/or the scripts they use are not clear. If so, it is what we need to pay attention to, because it may be a key to improving the efficiency of operators performance.

Step 6: General conlusions and recommendations

We have made a retrospective analysis of operators performance in order to find which thresholds we may use in order to understand which operators are ineffective in terms of the following metrics:

  • long waiting time for incoming calls,
  • share of missed calls to all incoming calls,
  • small number of outgoing calls

We have explored the data we have in hands and we have determined the thresholds which were:

  • in the case of share of long waiting time 39,6 seconds,
  • in the case of share of missed calls we defined that this share as the dimension of such share dividing 85 % operators from 15% less effective.
  • the most arguable question is the threshold for the number of outgoing calls. First, it was not very easy to understand which operators were supposed to make outgoing calls and which were not. Finally we decided to take into account only those who made per day more external outgoing calls than receiving externl incoming calls. But anyway, we found that the was a really big part of operators which made less 6 outgoing calls per day, while there are other, who made much more. Finally, we set up such threshold accordingly.

Not a suprise that the latter metric gives the main number of ineffective operators. So we suggest to evaluate once again if it is worth using such a metric in our situation. Perhaps it is more appropriate to use some other (such as the share of made calls to then number of calls defined by a specific KPI).

Taking into account the said assumptions we compared two groups of operators by testing two hypotheses:

  • the number of calls made per day by an operators differs within two groups,
  • the share of internal calls to all calls difers within two groups.

We first calculated the said metrics and then performed tests of significance of differences of distributions within two samples.

The results were quite intersting: we found that the operators do differ by the number of all calls and they differ in terms of share of internal calls to all calls.

The first outcome is not very reliable due to the fact that the a great part of the ineffective operators is made by those who make not enough ougoing calls.

But the latter fact may be a sign or a key to improving operators performance. We do not know why operators make internal calls, but perhaps they make them because they lack of self-confidence and/or scripts are not very good, therefore they need to call their colleagues/receive calls from colleagues inspite of making calls/receiving incoming calls.

List of sources

  1. https://en.wikipedia.org/wiki/Automatic_call_distributor - a Wikipedia article giving a brief information on what automatic call distribution is,
  2. https://towardsdatascience.com/visualize-missing-values-with-missingno-ad4d938b00a1 - an article on missingno library,
  3. https://stackoverflow.com/questions/16176996/keep-only-date-part-when-using-pandas-to-datetime - how to retrieve date from datetime keeping the datetime format,
  4. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html - how to convert different types of data format,
  5. https://plotly.com/python/distplot/ - plotly library distribution plots manual,
  6. https://www.geeksforgeeks.org/creating-a-dataframe-from-pandas-series/ - article on how to create a Pandas DataFrame from a series object,
  7. https://searchcustomerexperience.techtarget.com/tip/8-customer-service-metrics-to-measure-call-center-success - article with regard to the some techniques to measure productivity of call center agents(operators)